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1.  SUMMARY 

This  paper  describes  a system  for  the  automatically  learned 
partitioning  of  "visual  patterns"  in  digital  images,  based  on  a 
sophisticated,  band-pass,  filtering  operation,  with  fixed  scale 
and  orientation  sensitivity.  The  "visual  patterns"  are  defined 
as  the  features  which  have  the  highest  degree  of  alignment  in 
the  statistical  structure  across  different  frequency  bands.  Here 
we  show  a computational  visual  distinctness  measure 
computed  from  the  image  representational  model  based  on 
visual  patterns.  It  is  applied  to  quantify  the  visual  distinctness 
of  targets  in  complex  natural  scenes.  We  also  investigate  the 
relation  between  the  computational  distinctness  measure  and 
the  visual  target  distinctness  measured  by  human  observers. 

2.  INTRODUCTION 

Images  issued  from  the  environment  should  not  be  presumed 
to  be  random  patterns.  Instead,  real-world  images  contain 
characteristic  statistical  regularities  that  set  them  apart  from 
purely  random  images.  There  are  a number  of  statistical 
properties  that  we  might  consider  when  looking  at  real-world 
images,  and  many  of  the  important  forms  of  structure  that  are 
contained  in  2D  images  require  higher-order  statistics 
characterization.  Moreover,  Field  [1]  noted  that  there  is 
likely  to  be  a variety  of  features  which  extend  across  different 
frequency  bands.  For  instance,  the  presence  of  edges  and  lines 
in  an  image  corresponds  to  a type  of  congruence  between  the 
different  scales  of  the  image  which  is  destroyed  when  the 
phases  are  randomized  [2].  These  features  exist  because  some 
degree  of  alignment  exists  between  the  phases  at  different 
frequencies.  There  are  also  other  forms  of  congruence  across 
scales  in  2D  digital  images.  Field  [3]  suggested  that  the 
power  spectra  of  natural  images  falls  off  as  a function  of 
frequency  by  a factor  of  approximately  1/k2.  This  implies  that 
the  image  will  have  constant  variance  across  scales:  the 
contrast  as  measured  by  the  variance  in  pixel  intensities  should 
remain  roughly  constant,  independently  of  the  viewing 
distance.The  perceptual  organization  capabilities  of  human 
vision  seem  to  exhibit  the  properties  of  detecting  viewpoint- 
invariant  structures  and  calculating  varying  degrees  of 
significance  for  individual  instances  [4].  Fowe  [5]  proposed 
that  the  structures  to  be  detected  in  the  image  should  be 
formed  bottom-up  using  perceptual  grouping  operations  that 
exhibit  exactly  these  properties  in  the  absence  of  domain 
knowledge,  yet  must  be  of  sufficient  specificity  to  serve  as 
indexing  terms  into  a database  of  objects.  Given  that  we  often 
have  no  priori  knowledge  of  viewpoint  for  the  objects  in  a 
database,  these  indexing  features  that  are  detected  in  the  image 
must  reflect  properties  of  the  objects  that  are  at  least  partially 


invariant  over  a wide  range  of  viewpoints  of  some 
corresponding  three-dimensional  structure.  This  means  that  it 
is  useless  to  look  for  features  with  particular  sizes  or 
orientations  or  other  properties  that  are  highly  dependent  upon 
viewpoint.  The  second  constraint  on  these  indexing  features  is 
that  there  must  be  some  way  to  distinguish  the  relevant 
features  from  the  dense  background  of  other  image  features 
which  could  potentially  give  rise  to  false  instances  of  the 
structures. 

Often  implicit  in  the  interpretation  of  visual  search  tasks  is  the 
assumption  that  the  detection  of  targets  is  determined  by  the 
feature-coding  properties  of  low-level  visual  processing  [6], 
Instead  of  assuming  that  perceived  shapes  are  simple  or 
statistical  structure  at  a particular  scale,  we  think  it  more 
appropriate  to  regard  them  as  "visual  patterns",  distinguished 
at  an  object  level. 

Here  we  show  a particular  scheme  for  filtering  observed 
images,  designed  to  the  automatically  learned  partitioning  of 
features  (visual  patterns)  which  have  the  highest  degree  of 
alignment  in  statistical  structure  across  different  frequency 
bands.  These  features  are  likely  to  be  invariant  over  a range  of 
scales  and  orientations  and  can  be  judged  unlikely  to  be 
accidental  in  origin  even  in  the  absence  of  specific  information 
regarding  which  objects  may  be  present.  Then,  we  present  a 
computational  visual  distinctness  measure  computed  from  the 
image  representational  model  based  on  visual  patterns.  This 
measure  applies  a simple  decision  rule  to  the  distances 
between  segregated  visual  patterns,  and  it  will  be  used  to 
quantify  the  visual  distinctness  of  targets  in  complex  natural 
scenes.  The  analysis  to  the  automatically  learned  partitioning 
of  "visual  patterns"  (it  has  been  termed  RGFF  model)  follows 
three  stages:  Preattentive  stage,  Integration  stage,  and 
Learning  stage.  Fig.  1 shows  a general  diagram  describing 
how  the  data  flows  through  the  RGFF  model.  This  diagram 
illustrates  the  analysis  on  a given  image  of  a military  vehicle 
in  a complex  rural  background. 

In  the  preattentive  stage  of  the  RGFF  system  (Section  3),  the 
clumps  of  energy  in  the  Fourier  spectrum  of  the  image  are 
captured  into  a collection  of  oriented  spatial-frequency 
channels,  as  illustrated  in  Fig.  1 . The  segregation  of  these 
clumps  of  energy  induces  the  selection  of  a subset  of 
activated  filters  (which  are  selectively  sensitive  to  them)  from 
a filter  bank  of  log-Gabor  functions  centered  at  12  orientations 
and  5 ranges.  Due  to  conjugate  symmetry,  the  filter  design  is 
only  carried  out  on  half  the  2D  frequency  plane.  The  activated 
log-Gabor  filters  produced  by  the  preattentive  stage  are 
illustrated  in  the  diagram  by  ellipses  drawn,  in  the  2D  spatial- 
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Figure  I:  A general  diagram  describing  how  the  data  flows 
through  the  representational  model.. 


frequency  plane,  at  the  point  where  their  amplitude  has 
decreased  to  the  (c'l/2)  half  width  its  maximum. 

In  the  integration  stage  (Section  4).  for  any  two  activated 
filters,  their  responses  arc  compared  based  on  the  distance  (a 
P-norm  between  their  statistical  structure,  computed  over 
those  pixels  which  form  "fixation  points"  of  the  filters  (local 
energy  peaks  on  the  filtered  response). 

In  the  learning  stage  (Section  5),  clustering  on  the  basis  of  the 
distance  between  the  activated  filters  is  performed  to  highlight 
scale  and  orientation  invariance  of  responses. 

As  shown  in  Fig.  1,  three  collections  of  filters  were  obtained 
in  the  Learning  stage  for  the  input  image  in  accordance  with  a 
constraint  of  invariance  in  statistical  structure  across 
frequency  bands.  The  filtered  responses  of  activated  log- 
Gabors  in  each  one  of  the  three  groupings  were  summed  for 
the  automatic  learned  partitioning  of  the  visual  patterns. 

The  performance  of  this  notion  of  visual  pattern  to  segregate 
potential  targets  can  be  visually  evaluated  in  Fig.  1,  at  the 
bottom.  The  dominant  signal  in  the  output  from  detector  #2  is 
the  military  vehicle  (target)  which  is  well  preserved.  On  the 


contrary,  both  large  structures  and  fine  detail  of  the  natural 
background  were  removed,  even  though  significant 
background  clutter  that  can  affect  the  target  distinctness  is  still 
present.  In  fact,  the  fine  details  of  the  natural  background, 
which  are  not  significant  for  quantifying  the  target 
distinctness,  are  isolated  in  the  output  from  detector  #1.  And 
the  lower  frequency  texture  of  the  background  is  segregated 
into  the  output  from  detector  #3.  Fig.  2 demonstrates  the 
ability  of  the  same  model  to  achieve  signal  separation  from 
superposition  of  objects  on  three  synthetic  images.  The  image 
in  Fig.  2.A1  was  partitioned  into  two  "visual  patterns",  as 
shown  in  Figs.  2. Cl  and  2.F1.  In  the  learning  stage,  the  set  of 
activated  filters  was  partitioned  into  two  groupings  of  filters, 
as  shown  in  Figs.  2.B1  and  2.DI.  The  "visual  pattern"  shown 
in  Fig.  2. Cl  (respectively.  Fig.  2. FI)  was  obtained  by  the  sum 
of  the  responses  over  filters  in  Fig.  2.B1  (resp..  Fig.  2.D1). 
The  "visual  patterns"  obtained  by  the  model  on  Fig.  2.A2,  arc 
illustrated  in  Figs.  2.C2  and  2.F2.  The  learning  stage  produced 
two  collections  of  filters  as  shown  in  Figs.  2.B2  and  2.D2. 

The  right  column  in  Fig.  2 shows  the  signal  separation 
achieved  by  the  analysis  on  the  input  image  given  in  Fig. 

2.  A3. 

Finally.  Section  6 presents  the  computational  visual 
distinctness  measure  computed  from  the  image 
representational  model  based  on  visual  patterns.  As  illustrated 
in  Fig.  10.  this  measure  applies  a simple  decision  rule  to  the 
distances  between  segregated  visual  patterns. 

3.  PREATTENTIVE  STAGE 

In  the  RGFF  model,  the  encoding  strategy  will  rely  on  the 
combined  activity  of  subsets  of  filters.  Only  a small  number  of 
units  w ill  contribute  to  the  detection  of  each  visual  pattern. 
These  collections  of  filters  will  be  derived  from  a learning 
stage,  based  on  the  degree  of  congruence  between  the 
responses  of  strongly  responding  filters  that  a preattentive 
stage  produces.  There  are  two  basic  assumptions  for  this  first 
stage: 

1.  Spatial  information  on  the  image  is  analyzed  by  multiple 

filters,  each  of  which  is  sensitive  to  patterns  whose  spatial 

frequencies  are  in  a particular  range. 

2.  The  RGFF  model  bases  its  responses  only  on  those 

filters  sensitive  to  relevant  forms  in  the  complex  scene. 

These  assumptions  arc  in  agreement  with  models  of  spatial- 
frequency  channels  which  arc  quite  successful  for  the 
detection  of  visual  patterns  [7],  The  output  of  the  preattentive 
stage  will  be  the  units  from  a fixed  filter  bank  of  log-Gabors 
which  arc  tuned  to  the  clumps  of  energy  in  the  Fourier 
spectrum  of  the  given  image.  The  selected  units  are  the  filters 
in  the  bank  which  strongly  respond  to  some  pattern  that  the 
image  contains.  These  filters  are  referred  as  the  “activated" 
filters  of  the  bank.  Also  for  each  activated  filter,  pixels 
whereupon  the  focus  of  attention  should  be  shifted  to  measure 
congruence  and  which  form  "fixation  points",  are  computed  as 
local  energy  peaks  on  the  filtered  response.  This  processing  is 
based  on  current  models  of  human  visual  search  and  detection 
which  assume  that  a preattentive  stage  indicates  potentially 
interesting  image  regions,  and  where  a serial  stage  is 
deployed  to  analyze  them  in  detail  [6.7]. 

3.1.  Bank  of  filters 

The  set  of  filters  used  in  the  decomposition  of  the  picture 
consists  of  log-Gabor  filters  of  different  spatial  frequencies 
and  orientations  [3] . Log-Gabor  functions,  by  definition,  have 
no  DC  component.  The  transfer  function  of  the  log-Gabor  has 
extended  tails  at  the  high  frequency  end.  Thus  it  should  be 
able  to  encode  natural  images  more  efficiently  than  ordinary 
Gabor  functions,  which  would  over-represent  the  low- 
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Figure  2:  Automatically  learned  partitioning  of  "visual 
patterns"  in  synthetic  image  data. 


frequency  components  and  under-represent  the  high  frequency 
components  in  any  encoding  process.  Another  argument  in 
support  of  the  log-Gabor  functions  is  the  consistency  with 
measurements  on  the  mammalian  visual  system  [8]. 

A Log-Gabor  filter  determines  a Gaussian  in  the  spatial 
frequency  domain  around  some  central  frequency  (r„,  9J.  It 
can  be  represented  in  the  frequency  domain  as  the  sum  of  the 
even-symmetric  log-Gabor  filter  and  i times  the  odd- 
symmetric  log-Gabor  filter  as  follows: 


#-„,0J  = exp 


0°g(  r/r  ))2 
— - 1 exp 

2(log(  <yrjr0))1 


(odUL  1 

2 o]  J 


(1) 


where  0,;-is  the  orientation  angle  of  the  filter,  r„  is  the  central 
radial  frequency,  ag  and  ar  are  the  angular  and  radial  sigma  of 
the  Gaussian,  respectively. 

The  convolution  of  a log-Gabor  function  (whose  real  and 
imaginary  parts  are  in  quadrature)  with  a real  image  results  in 
a complex  image.  Its  norm  is  called  energy  and  its  argument  is 
called  phase.  The  local  energy  of  the  image  analyzed  by  a 
log-Gabor  filter  (hereafter,  filtered  response)  can  be  expressed 
as  [3]: 


E(x,y)  = sj0len  (x,y)  + 02M  ( x,y ) (2) 

where  Ocw„(x,y)  is  the  image  convolved  with  the  even- 
symmetric  log-Gabor  filter  and  Otll/fx,y)  is  the  image 
convolved  with  the  odd-symmetric  log-Gabor  filter. 

The  real-valued  function  given  in  equation  (1)  can  be 
multiplied  by  the  frequency  representation  of  the  image  and 
after  transforming  the  result  back  to  the  spatial  domain,  the 
results  of  applying  the  oriented  energy  filter  pair  are  extracted 
as  simply  the  real  component  for  the  even-symmetric  filter  and 
the  imaginary  component  for  the  odd-symmetric  filter  [9],  The 
bank  of  the  filters  should  be  designed  so  that  it  tiles  the 
frequency  plane  uniformly  (the  transfer  function  should  be  a 
perfect  bandpass  function).  The  length  to  width  ratio  of  the 
filters  controls  their  directional  selectivity.  The  ratio  can  be 
varied  in  conjunction  with  the  number  of  orientations  used  in 
order  to  achieve  a coverage  of  a 2D  spectrum.  Furthermore,  as 
the  degree  of  blurring  introduced  by  the  filters  increases  with 
their  orientational  selectivity,  they  must  be  carefully  chosen 
to  minimize  the  blurring.  Hence  we  consider  a filter  bank 
with  the  following  features: 

1 . The  spatial  frequency  plane  is  divided  into  12  different 
orientations. 

2.  The  radial  axis  is  divided  into  5 equal  octave  bands.  In  a 
band  of  width  1 octave,  spatial  frequency  increases  with  a 
factor  2.  The  highest  filter  (for  each  direction)  is  positioned 
near  the  Nyquist  frequency  to  avoid  ringing  and  noise. 
The  wavelength  of  the  five  filters  in  each  direction  is  set  at 

3.  6,  12,  24,  and  48  pixels,  respectively. 

3.  The  radial  bandwidth  is  chosen  as  1.2  octaves. 

4.  The  angular  bandwidth  is  chosen  as  1 5 degrees. 

Twelve  different  angles  for  each  resolution  are  chosen  and 
five  different  resolutions  are  used.  The  resultant  filter  bank  is 
illustrated  in  Fig.l.  Due  to  conjugate  symmetry,  the  filter 
design  is  only  carried  out  on  half  the  2D  frequency  plane.  The 
log-Gabor  filters  are  illustrated  in  the  diagram  by  ellipses 
drawn,  in  the  2D  spatial-frequency  plane,  at  the  point  where 
their  amplitude  has  decreased  to  the  (e'1'2)  half  width  its 
maximum. 

3.2.  Activated  filters  in  the  bank 

In  order  to  decompose  the  image  into  its  most  significant 
components,  strongly  responding  filters  should  be  selected  for 
the  input  image. 

Let  Active  be  the  set  of  filters  in  the  bank  that  strongly 
respond  to  the  spatial  information  content.  They  will  be 
selectively  sensitive  to  patterns  in  the  scene.  These  patterns 
produce  clumps  of  energy  upon  the  Fourier  spectrum  of  the 
image.  The  activated  units  from  the  bank  are  then  simply 
those  filters  whose  amplitude  spectrum  and  some  clump  of 
energy  in  the  image  amplitude  spectrum  overlap  to  some 
extent,  as  illustrated  in  Fig.l. 

3.3.  Selection  of  fixation  points 

In  the  integration  stage,  for  any  given  two  activated  filters,  a 
distance  between  them  is  derived  via  distances  between  their 
statistics.  The  distance  chosen  is  the  P-norm,  computed  over 
those  pixels  which  form  "fixation  points"  of  the  filters.  The 
"fixation  points"  are  simply  local  energy  peaks  on  the  filtered 
response.  The  standard  argument  for  selecting  regions  of  high 
Gabor  energy  is  that  they  would  provide  a good  starting  point 
for  exploring  common  grounds  between  several  activated 
filters  in  the  Gabor  space.  The  implementation  of  the  local- 
energy  model  used  here  is  the  one  presented  in  [10].  Given 
the  original  image,  the  local  energy  map  Et  for  the  activated 
filter  (j>h  given  in  equation  (2),  yields  a representation  in  the 
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space  spanned  by  two  functions,  Om.n(x,v)  and  0,,,/fx.y), 
where  Ocvl.„(x,y)  is  the  image  convolved  with  the  even- 
symmetric  log-Gabor  filter  and  0„,ifx,y)  is  the  image 
convolved  with  the  odd-symmetric  log-Gabor  filter  at  (x.v). 
Hence,  the  detection  of  peaks  on  the  E,  map  acts  as  a detector 
of  significant  features  on  the  filtered  response. 

4.  INTEGRATION  STAGE 

Given  a decomposition  of  the  original  image  into  its  most 
significant  components,  only  a further  clement  is  needed  to 
define  the  concept  of  visual  pattern:  a distance  measure, 
denoted  as  Dislance(<pi  ,<f>j)  , between  the  statistical  structures 
of  the  filtered  responses  for  each  pair  of  filters  <j>,  and  <j>j 
(Section  4.2.).  Then,  Distance((/>,  ,<f ))  returns  a value  of  the 
degree  of  congruence  between  statistical  structure  at  different 
scales  and  orientations. 

There  arc  two  basic  assumptions  for  measuring  congruence 
between  two  filtered  responses  in  this  second  stage: 

1.  The  similarity  between  two  filtered  responses  can  be 
measured  by  the  Quick  pooling  of  the  differences  bctu'ccn 
their  statistical  structure. 

2.  The  measure  of  similarity  is  not  simply  computed  globally 
over  the  entire  filtered  response,  but  semi-locally  at  locations 
that  arc  local  energy  peaks  (fixation  points). 

Previously,  it  was  demonstrated  [II]  that  a measure  based  on 
these  two  assumptions  produces  a good  predictor  of  target 
salicncy  for  humans  performing  visual  search  and  detection 
tasks. 


distance  between  (x.y)  and  the  nearest  local  minimum  to  (x.y) 
on  the  energy  map  F.,.  Since  the  nearest  local  minimum  to 
(x,  y)  on  the  local  energy  map  marks  the  beginning  of  another 
potential  structure,  our  selection  for  the  neighborhood  W(x,  y) 
avoids  interference  with  such  a structure  while  the  local 
variation  is  computed  [10], 

4.  The  local  contrast  of  the  normalized  local  energy  defined 
as: 


77(-v,y)  = 


Tj  (x.y)1 
/' 


where 


(6) 


P = 


1 

Card  [iT  ( a'  ._>•)] , 


E 

■lU"  {X-V) 


5.  The  local  entropy  of  the  normalized  local  energy  within 
W(x.y),  noted  as  T^(p.q). 


Although  we  propose  these  five  features,  any  other  intent  to 
capture  relevant  characteristics  of  the  scene,  while  stable  for 
the  representation  of  the  image  is  also  conceivable.  Hereafter, 
an  "integral  feature"  is  defined  as  a particular  subset  of 
separable  features  at  a fixation  point  [12], 

For  representing  the  filtered  responses  of  the  input  image, 
different  definitions  of  integral  feature  can  be  given  based  on 
different  subsets  of  separable  features.  Consequently,  the 
system  should  learn  the  best  integral  feature  definition  for  the 
input  image  in  which  to  look  for  invariance  across  orientations 
and  scales.  This  point  is  analyzed  in  Section  6.4. 


4.1.  Definition  of  integral  feature  for  the  partitioning  of 
visual  patterns 

For  each  activated  filter  tj>, , the  respective  filtered  response 
may  be  represented  by  any  subset  of  the  following  separable 
features: 

l.Thc  phase  value  defined  as: 


7]'(x,y)  = arctan  °^X'y)  (3) 

0Mh,(x,y) 


where  Om„(x,y)  is  the  image  convolved  with  the  even- 
symmetric  log-Gabor  filter  of  (4  , and  0,„/fx.y)  is  the  image 
convolved  with  the  odd-symmetric  log-Gabor  filter  of  <4  at 
( x.y ) (Section  3.1.). 

2.  A normalized  measure  of  local  energy  as  given  by: 


Ti(x,y)  = 


E,(x,y) 


(4) 


4.2.  Congruence  in  integral  features  between  two  filtered 
responses 

In  order  to  define  a distance  between  the  integral  features  of 
two  filtered  responses,  we  need  to  specify  how  the  differences 
in  each  separable  feature  arc  to  be  pooled  into  an  overall 
difference  at  fixation  points. 

Let  (4  and  f be  a pair  of  activated  filters  in  Active.  Let 
T'(x,y)=(  Tj  ( P.q))t<k<t . with  lke{!,2 5},  be  the  integral 

feature  at  (x.v)  computed  on  the  filtered  response  of  tj>, , based 
on  a number  of  L separable  features  (Section  4.1 .).  In  a 
similar  way.  let  T '(.v,  v)  = (77  (l>,q)h<k<t.<  be  the  integral 

'k 

feature  at  (x.y)  on  the  filtered  response  of  </>,. 

We  take  D[T' (x.y),  T'(x,y)\  defining  a distance  measure 
between  integral  features  T'(x,y)  and  T'(x.y)  as  given  by  the 
equation: 


d[t' (x,y).T ' (,r,y)]=  £77— < (x,v),T'(x,  y)) 
A lax. 


(7) 


where  E , (x,y)  denotes  the  local  energy  at  (x.y)  for  filter  f (sec 
equation  2 for  further  details),  and  Active  is  the  set  of 
activated  filters  for  the  image.  This  definition  of  a normalized 
local  energy  incorporates  lateral  interactions  among  activated 
filters  to  account  for  between-filter  masking. 

3.  The  local  standard  deviation  of  the  normalized  local 
energy  defined  as: 


Tiix.y ) = 
where 


1 

Card  [W(x,y)] 


E (7V  (p-n)-fO 

(p. •!)(»■  (x.y) 


Card  [H/(x,y)](/,  ,,15(7;i(/J’<7) 


and  7V  (p,q)  as  given  in  equation  (4).  The  neighborhood 
W(x,y)  is  defined  as  the  set  of  pixels  contained  in  a disk  of 
radius  r centered  at  (x.y).  Let  /■  be  defined  as  the  Euclidean 


where  normalization  Max  is  defined  as: 

'k 

Max , = max  [d ( T" (p.q).  Tf  (p.q))  | (p.q)  e FP  ( n ) } 

n in  tp „ e . ft  live 


with  FP(n)  being  the  fixation  points  for  the  activated  filter  </>„ 
and  Active  being  the  set  of  activated  filters;  and  where  for 
/<=7,  we  have: 


d(T;(x.y)f'(x.y)) 


\a  retail 


sin(T;(x.y)-T/(x,y)) 
cos(7"  (x.y)-Tt'  (x,v)) 


(8) 


and  for  /*=  2, 3,4, 5: 

dCf  (x.  v).  T’  (x.  v))  = \l  1 (x. y)  - '/;/  (x.  y)|  (9) 

The  congruence  in  integral  features  between  two  filtered 
responses  is  computed  by  using  Quick  pooling  [13],  It  is  the 
most  common  model  of  integration  over  spatial  extent,  and  is 
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essentially  the  square  root  of  the  squares  sum  except  that  the 
exponent  is  not  restricted  to  the  value  of  2.  The  Quick  pooling 
can  be  viewed  as  a metric  in  a multidimensional  space,  and  it 
is  sometimes  known  as  Minkowski  metric. 

The  distance  between  the  filtered  responses  of  </),  and  $,  which 
provides  a measure  of  the  extent  to  which  features  extend 
through  frequency,  is  given  by: 


Distance^,,  tj)j  )=Dist  2[i,j]+Dist2[j,i\ 


with  FP(p)  being  the  set  of  fixation  points  for  the  activated 
filter  tf)p\  and  where  D[T  (x,y),  T \x,y)]  is  defined  as  given  in 
equation  (7). 

The  default  value  of  the  exponent  /?  in  equation  ( 1 1 ) is  3. 
Graham  [7]  discussed  at  some  length  several  interpretations 
of  the  Quick  pooling  formula  and  the  selection  of  the  pooling 
exponent. 


5.  LEARNING  STAGE 

Based  on  a measure  of  the  extent  to  which  features  extend 
through  frequency,  noted  as  Distance^, <jij),  a "visual  pattern" 
is  simply  defined  as  congruence  in  statistical  structure,  as 
measured  by  Distance,  across  a range  of  2D  spatial  frequency 
bands. 

The  individual  filters  spanning  this  particular  range  of  bands 
will  determine  a natural  cluster  of  units,  noted  as  C,„  in  the  set 
of  activated  logGabors  Active.  By  taking  into  account  the 
statistical  congruence  across  this  range  of  frequency  bands,  a 
pair  of  filters  $ and  ^ will  belong  to  the  same  natural  cluster 
C„  if  there  exists  certainly  continuity  (i.e.,  there  exists 
similarity  in  some  statistics  at  the  same  spatial  locations) 
across  the  filtered  responses  for  an  intermediate  sequence  of 
filters,  between  $ and  $ , in  C„. 

Therefore  the  definition  of  "visual  pattern"  induces  a partition 
in  Active  into  a number  of  natural  clusters  Ch  C2,  ...,Cn  such 
that: 

Active  =1JC„  , and  Cpf]C<t  =<j>, 
with  p*q,p,q  = \,2,..,N 


where,  for  each  C„ , a pair  of  filters  tj>, , $ e C„  if  there  exists  a 
sequence  of  filters  <j>„  , in  C„  such  that 

Distance  ($,,$„  ) i £„ 

Distance  {<f  „ ' 

Distance  {</> , <j | ) £ s n , k = 1,2,...,  / - 1 

where  sn  denotes  the  degree  of  statistical  congruence 
between  a pair  of  filters  in  C„  and  verifies  that: 

Distance  (14) 

v ^,.0,  : K e Cn,<j>q  6 Active  - C„ 

The  clustering  of  activated  filters  is  performed  as  described  in 
Section  5.1.  Figs.  3 and  4 illustrate  the  performance  of  the 
clustering  on  several  images  of  a target  in  a complex  rural 
background. 

The  image  in  Fig.  3.A1  was  partitioned  into  the  two  visual 
patterns  shown  in  Figs.  3. Cl  and  3. El. 

In  the  clustering  process,  the  set  of  activated  filters  was 
partitioned  into  two  collections  of  filters,  as  shown  in  Figs. 
3.B1  and  3.D1.  The  "visual  pattern"  shown  in  Fig.  3. Cl 
(respectively,  Fig.  3.E1)  was  obtained  by  the  sum  of  the 
responses  over  filters  in  Fig.  3.B1  (resp.,  Fig.  3.D1). 

The  right  column  in  Fig.  3 shows  the  separation  achieved  by 
the  analysis  on  the  image  in  Fig.  3.A2. 


1E11  (E2) 

Figure  3:  Natural  clusters  of  activated  filters  and  the 
respective  visual  patterns. 


The  visual  patterns  produced  by  the  model  on  Fig.  4.A1 
(respectively,  4.A2)  are  illustrated  in  Figs.  4. Cl,  4. El,  and 
4.G1  (resp.,  Figs.  4.C2,  4.E2,  and  4.G2).  In  both  cases,  the 
clustering  of  activated  filters  produced  three  collections  of 
filters  as  shown  in  Fig.  4. 

5.1.  Clustering  of  activated  filters 

We  formulate  the  problem  as  the  clustering  of  a dataset 
X={i  | f e Active}  into  a number  N of  natural  clusters 
£a,gt fN_]  -We  call  clusters  natural  if  the  membership  is 

determined  fairly  well  in  a natural  way  by  the  data. 


2-6 


This  clustering  is  reduced  to  a sequence  of  stages  of  simpler 
partitioning  [14].  At  each  stage  j,  a subset  A''  of  A' is  divided 
into  only  two  classes  (for  j=0,  X"  = X): 

1 . a natural  cluster  £ which  contains  all  the  data  points  (filters) 

in  X'  which  are  assigned  the  same  class  of  a seed  point  (filter) 
seed ),  with  seed\  being  picked  randomly  from  X',  and 

2.  the  data,  xi- £ , still  not  placed  in  any  existing  cluster,  noted 

asC„'(r, Cj.r 

The  clarity  of  separation  between  clusters,  as  measured  by  a 
dissimilarity  function,  will  be  the  criterion  by  which  we  derive 
a natural  cluster  g at  stage  j.  The  dissimilarity  function  is 

defined  in  Section  5.1.1.  The  criterion  by  which  we  define  a 
natural  cluster  at  stage  j is  presented  in  Section  5.1.2. 

The  dynamic  process  of  clustering  is  stopped  at  stage  j if  the 
class  X~  tf  is  the  empty  set.  Otherwise,  the  process 

progresses,  and  the  subset  Xj‘ ' to  be  partitioned  at  the  stage 
j+1,  it  will  be  the  one  defined  as  xj‘  '=X'~f  ■ Finally,  the 

natural  clusters  in  Active  verifying  equations  ( 1 2)-(  1 4)  arc 
induced  as: 

C „ = {tj> , e Active  \ ie£n_,}  with  n = 1.2,..,  A'  (15) 

and  where  N denotes  the  number  of  clusters  into  which  X={i 

| fa  e Active}  was  partitioned,  that  is  f £ Sec  Fig.  I 

for  further  illustration  of  this  analysis. 

5.1.1.  Dissimilarity  function 

Let  X'  be  a subset  of  data  not  absorbed  in  any  of  the  existing 
clusters  gn,  ( , at  the  stage  j of  the  dynamic 

processing;  with  X"  being  the  given  data  set. 

X<>=  X=fi  | fa,  e Active}.  Next  we  define  a graph  GRAPH’  = 
(X1 , UJ)  corresponding  to  the  data  subset  X'.  and  with  U' 
being  the  set  of  arcs  u=(k,l)  between  pairs  of  points  in  X1.  We 
associate  with  each  arc  u ell’  a real  number  l(tt)>0,  and  if 
u~(k,l),  we  shall  also  use  the  notation  lu  for  l(u).  Let  /w  be 
the  distance  from  k to  / defined  as: 

/,.,  = Distance  (<pt  ,</>,)  (16) 

where  Distance^,  tj>l)  measures  the  distance  between  the 
filtered  response  of  filters  tfa  and  fa  as  given  in  equation  (10). 
The  cost  of  a path  is  defined  as  the  greatest  distance 
between  two  successive  vertices  on  the  path.  Let  //( seed,,k ) be 
a set  of  arcs  constituting  a path  between  two  points  seed,  and  k 
in  X '.  And  let  1(jl)  represent  the  cost  of  /<. seedj.k ) from  seed, 
to  k defined  as  follows: 

l(j)=max{l(u)  | ue/.t(scedrk)l  (17) 

Taking  into  account  that  two  filters  belong  to  the  same  cluster 
if  there  exists  continuity  (i.e„  there  exists  similarity  in  their 
statistics  at  the  same  spatial  locations)  across  the  responses  of 
filters  in  a path  between  them,  the  dissimilarity  function  is 
next  defined  as  the  cost  of  the  optimum  path  from  a seed 
point  to  each  other  on  the  graph.  The  optimum  path  between 
two  data  points  seed \ and  k is  the  path  /.i*(seed,,k)  from  seed, 
to  k whose  maximum  cost  /(//*)  is  minimum: 

H * {seed  ,,k)=  Argmin  [max{l(it)\ue  fi(seed  ;,A)}j 


(G1)  (G2) 


Figure  4:  Natural  clusters  of  activated  filters  and 
the  respective  visual  patterns. 
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Visual  target  distinctness  | 
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5 
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7 
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a 
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8 
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# 21 
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9 

1.35 

10 

2.046 

10 

# 11 

73.40 

10 

1.96 

1 

ni 

2.690 

ID 

1 

Per. 

!■■■ 

| 0.8 

Table  1:  Comparative  results  of  the  RMSE  metric  and 
the  computational  visual  distinctness  measure. 


Hence,  the  dissimilarity  from  the  viewpoint  of  seed]  to  each  k , 
is  defined  as  the  cost  of  the  optimum  path  p*(  seedj.k)  from 
seedj  to  each  k: 

dlM!,‘"'  (seed  j,k)  = l(p*)=max{l(u)\u<=  p*(seed  j,k)}  09) 

with  p*  being  the  optimum  path  between  seed]  and  k.  The 
optimal  path  algorithm  is  given  in  [14], 

5.1.2.  Clarity  of  separation  at  stage  j 

Here  we  introduce  the  criterion  by  which  we  define  the  natural 
cluster  g at  stage  J. 

The  set  {dGRAPHi(>eeafj,  k ),  with  keXj  } is  firstly  ordered  to 
obtain  a new  function: 

dJ(i)=dGR',,'"t (seed suchthat d t(i)<d ((/+1)  (20) 

where  dj  (i)  denotes  the  cost  of  the  optimum  path  from  seed] 
to  kh 

Let  st  represent  the  degree  of  closeness  that  is  required 
between  a pair  of  points  that  belong  to  the  natural  cluster  of 
seed] , noted  as  g . Taking  into  account  that  dj(i)  measures 

the  closeness  between  seed]  and  k,  with  k , eX,  we  have  that 
Sj  can  be  defined  as: 

Sj  = dj  (/*)  (21) 

with  i*  being  the  location  of  the  first  significant  rise  in  the 
value  of  dj( i)  when  i increases.  The  value  of  i*  is  computed 
as  the  first  zero  crossing  of  the  second  derivative  of  dj,  as 
described  in  Appendix. 

A point  k,  from  XJ  is  then  assigned  the  same  cluster  of  seed] 
if  the  closeness  between  seed ) and  is  less  than  or  equal  to 
si- 

kj  e £ , if  dj  (i)  < Sj ; otherwise  kt  g £ 


6.  PREDICTING  VISUAL  TARGET  DISTINCTNESS 

This  section  presents  a computational  visual  distinctness 
measure  computed  from  the  image  representational  model 
based  on  visual  patterns. 


Figure  5:  Cumulative  distribution  functions  to  the 
search  times  for  the  target  scenes 


The  approach  is  as  follows.  First,  a psychophysical 
experiment  is  performed  in  which  observers  estimate  the 
visual  distinctness  of  the  target  in  each  of  44  different  test 
scenes  (Section  6.2.).  Second,  a computational  measure  is 
defined  and  then  applied  to  quantify  the  visual  distinctness  of 
the  targets  (Section  6.3.).  Finally,  an  experiment  is 
performed  to  investigate  the  relation  between  the 
computational  distinctness  measure  and  the  visual  target 
distinctness  measured  by  human  observers  (Section  6.4.). 

6.1.  Images 

The  images  used  in  this  study  are  slides  made  during  the 
DISSTAF  (Distributed  Interactive  Simulation,  Search  and 
Target  Acquisition  Fidelity)  field  test,  that  was  designed  and 
organized  by  NVESD  (Night  Vision  & Electro-optic  Sensors 
Directorate,  Ft.  Belvoir,  VA,  USA)  and  that  was  held  in  May 
and  June  1995  in  Fort  Hunter  Liggett,  California,  USA 
[15]. These  slides  depict  44  different  scenes. 

Each  scene  represents  a military  vehicle  in  a complex  rural 
background. The  9 different  vehicles  that  are  deployed  as 
search  targets  are  respectively  a BMP-1,  a BTR-70,  an 
HMMVV-Scout,  a HMMVV-Tow,  an  M1A1,  an  M3-Bradley, 
an  M60,  an  Ml  13,  and  a T72.  The  visibility  of  the  targets 
varies  throughout  the  entire  stimulus  set.  This  is  mainly  due  to 
variations  in  the  structure  of  the  local  background, 
the  viewing  distance,  the  luminance  distribution  over  the 
target  support  (shadows),  the  orientation  of  the  targets,  and 
the  degree  of  occlusion  of  the  targets  by  vegetation. 

The  images  used  in  the  computational  experiments  are 
subsampled  to  256x256  pixels.  For  each  scene  t,  containing  a 
target  (vehicle),  a corresponding  empty  scene  e was  created 
[6]  .The  empty  scene  is  everywhere  equal  to  the  target  scene, 
except  at  the  location  of  the  target,  where  the  target  support  is 
filled  with  the  local  background. This  replacement  is  done  by 
hand,  using  the  rubber  stamp  tool  in  Photoshop  3. 05. The  result 
is  judged  by  eye  and  is  accepted  if  the  variation  in  the 
background  over  the  target  support  area  does  not  appear  to 
have  an  appreciable  contrast  with  the  natural  variation  in  the 
local  background. 

In  the  experiment  here  reported  (Section  6.4),  the  digital 
images  were  (see  Figs.  6-9):  (i)  ten  complex  natural  images 
containing  a single  target  that  correspond  to  the  scenes  16,  9, 
37,  6,  30,  26,  29,  3,  21,  and  1 1,  from  the  44  slides  made 
during  the  DISSTAF  field  test;  and  (ii)  the  corresponding 
empty  images  of  the  same  rural  backgrounds  without  target, 
that  were  created  using  the  rubber  stamp  tool  in  Photoshop 
3.05. 

For  each  target  image.  Figs.  6-9  illustrate  the  simple 
thresholding  of  the  visual  pattern  produced  by  the  natural 
cluster  of  filters  in  Active  that  segregates  the  military  vehicle 
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(target  detector).  Simple  thresholding  was  applied  to  remove 
small  response  values  which  were  present  in  the  output  of  the 
target  detector.  In  the  same  figures,  it  is  also  shown  the  visual 
pattern's  thresholding  produced  by  the  target  detector  when 
it  is  applied  on  the  respective  image  without  target. 

To  produce  the  results  shown  in  Figs.  6-9,  the  definition  of 
integral  feature  used  for  the  partitioning  of  the  visual  patterns 
in  accord  with  a constraint  of  invariance 
was  as  follows  (Section  4. 1 ): 

• for  the  target  scenes  30,  37,  and  16.  T = (Th  T},  Ts): 

• for  26  and  6,  T=  (T 5); 

• for  9 ,T=  (1 ’/,  T2,  TJ; 

• for  29,  T = (T,.  Ts): 

• for  3,7'=  (TV; 

• for  21,  T = (T4,  Ts)\  and 

• for  11,7’=  (T/,  TJ- 

Section  6.4  analyzes  how  the  best  definition  of  integral 
feature  for  predicting  visual  target  distinctness  can  be 
estimated  on  a dataset  of  example. 


6.2.  Psychophysical  target  distinctness 

A psychophysical  experiment  was  performed  in  which 
observers  estimate  the  visual  distinctness  of  the  target. 

Search  times  and  cumulative  detection  probabilities  were 
measured  for  nine  military  targets  in  complex  natural 
backgrounds. A total  of  64  civilian  observers,  aged  between  18 
and  45  years,  participate  in  the  visual  search  experiment. 

The  procedure  of  the  search  experiment  is  described  in  [6], 
Search  performance  is  usually  expressed  as  the  cumulative 
detection  probability  as  function  of  time,  and  it  can  be 
approximated  by  [6]  : 


fo  . t<t„ 

p (A  = J ’ 11 

\l-exp{-(t-/„)/p}  , t>t„ 

where 

• Pft)  is  the  fraction  of  correct  detections  at  time  t, 

• t0  is  the  minimum  time  required  to  response,  and 

• pis  a time  constant. 


(22) 


Fig.  5 shows  the  cumulative  distribution  functions 
corresponding  to  the  search  times  measured  for  the  target 
scenes  used  in  the  experiment  here  described.  The  overall 
difference  between  two  of  these  functions  can  be  measured  by 
subtracting  the  area  beneath  their  graphs.  This  operation 
corresponds  to  a Kolmogorov-Smirnov  (K-S)  test.  To  compare 
the  relative  distinctness  of  the  targets  in  the  different  target 
scenes  the  curves  are  rank-ordered  according  to  the  area 
beneath  their  graphs.  The  resulting  rank  order  for  the  target 
scenes  is  listed  in  the  column  with  the  header  Rpt/  in  Table  1. 
These  rank  orders  arc  adopted  as  the  reference  standard  for  the 
evaluation  of  the  computational  metric. 

Targets  that  give  rise  to  closely  spaced  cumulative  detection 
curves  which  arc  similar  in  accordance  with  a K-S  test,  have 
similar  visual  distinctness.  Fig.  5 shows  that  the  target  images 
in  the  experiment  arc  clustered  into  a number  of  sets  of 
targets  with  comparable  visual  distinctness:  { 16.  9,  37}.  {6, 
30,  26,  29},  and  {3,  21,  1 1 }.  Consequently,  rank  order 
permutations  of  elements  of  the  same  cluster  arc  not  very 
significant,  whereas  rank  order  permutations  of  elements  of 
different  clusters  arc  therefore  significant. 


6.3.  Computational  Target  Distinctness 

Let  C„  -{</>,, jj,  with  n=I,  2,  ...,  N,  be  the  N natural  clusters  in 
Active  produced  by  the  RGFF  model  for  the  target  image 


Figure  6:  Target  and  empty  images.  Simple  thresholding  of  the 
visual  patterns  produced  by  the  target  detetector  on  them. 
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Figure  7:  Target  and  empty  scenes  in  the  dataset,  and  the 
simple  thresholding  of  the  visual  patterns  produced  by  the 
target  detector  when  it  is  applied  on  them.  Simple 
thresholding  was  used  to  remove  values  which  were  present 
in  the  output  of  the  target  detector. 


Let  t„  represent  the  visual  pattern  segregated  on  the  reference 
target  image  t(x,y)  by  pooling  the  responses  of  filters  in  the 
natural  cluster  C„  ={tj>„f  as  follows: 


t„ 


(23) 


where  Anj,  denotes  the  original  image  t(x,y)  filtered  through 
the  logGabor  (jnj  in  C„  and  passed  through  a non-linearity 
of  the  form: 


tanh(z,  t) 


1 - exp{-zr} 
1 + exp{-zr} 


(24) 


where  r is  a gain  term  [16].  This  nonlinearity  enables  the 
system  to  respond  to  local  contrast  over  several  log  units  of 
illumination  changes. 


Figure  8:  Target  and  empty  scenes  in  the  dataset,  and  the 
simple  thresholding  of  the  respective  visual  patterns  by  the 
target  detector. 


Therefore  th  t2, ....  tN  represent  a decomposition  of  the 
reference  target  image  t into  the  set  of  its  most  significant 
visual  patterns. 

In  order  to  compensate  for  the  effect  of  image-to-image 
variations  on  the  overall  image  light  level,  contrast 
normalization  of  each  visual  pattern  is  realized  by  dividing  t„ 
by  the  sum  of  all  filtered  responses  in  Active,  plus  a saturation 
constant  a. 


-I,w 


(25) 


where  A,  denotes  the  original  image  t(x,y)  filtered  through  the 
logGabor  f in  Active  and  passed  through  a non-linearity  as 
given  in  equation  (24). 

Similarly  passing  the  corresponding  empty  image  e(x,y) 
through  the  filters  associated  with  each  cluster  C„  produced  by 
the  model  on  the  reference  image  t{x,y),  results  in  a 
decomposition  of  e in  e,,  e2,...,  eN. 
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Let  dyp(t„,  e„)  be  the  difference  between  the  visual  patterns  t„ 
and  e,„  computed  via  the  /?-norm  between  their  statistical 
structure  over  those  pixels  which  form  "fixation  points"  on  t„ 

[11]: 


dyp(.t„,e„)= 


1 

Card[FP(t„)\ I 


d[t'"  (x,y)J"'r  (*,/)]  p 


(26) 


with  FP(l„)  being  the  set  of  fixation  points  for  t„:  and 
D[T"{x,y),  T'"(.x,yj]  defining  a normalized  distance  measure 
between  the  integral  features  T'"(x,y)  and  T‘"(x,y)  computed 


on  t„  and  e,„  respectively.  The  default  value  of  the  exponent  [) 
in  Equation  (26)  is  3. 

Based  on  a definition  of  "visual  pattern"  as  congruence  in  T 
across  frequency  bands,  the  differences  between  segregated 

visual  patterns,  D„=dyi>(t,„e„)  , n=I,2 A',  determine  the 

overall  distinctness  between  the  reference  target  image  t and 
the  corresponding  empty  image  e by  using  a simple  decision 
rule: 


W»r(/,c)  = -£D(t 

n- 1 


(27) 
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A schematic  overview  of  the  KfY  distinctness  measure  is 
given  in  Fig.  10. 


6.4.  Relation  between  the  computational  and 
psychophysical  target  distinctness  estimates 

All  the  possible  definitions  of  T were  considered  by 
recombining  any  subset  of  the  next  separable  features: 

• the  phase  T h 

• the  local  energy  T2, 

• the  standard  deviation  of  the  local  energy  Th 

• the  local  contrast  of  the  local  energy  T4.  and 

• the  entropy  of  the  local  energy  Ts. 

For  each  specific  definition  of  integral  feature,  noted  as  T. 
the  notion  of  congruence  in  T across  frequency  bands  was 
used  to  decompose  the  images  into  its  visual  patterns. 

The  VPr  measure  was  then  applied  to  quantify  the 
visual  distinctness  of  the  targets.  The  subjective  ranking 
induced  by  the  psychophysical  target  distinctness  was  the 
reference  rank  order. 

In  order  to  study  the  efficacy  of  each  definition  T of  integral 
feature  for  predicting  target  distinctness  in  a complex  natural 
background,  the  fraction  of  correctly  classified  targets  (with 
respect  to  the  reference  rank  order)  by  the  VP,  measure  was 
computed  on  the  dataset.  Targets  that  give  rise  to  closely 
spaced  cumulative  detection  curves  which  arc  similar  in 
accordance  with  a Kolmogorov-Smirnov  test,  have  similar 
visual  distinctness  (Section  6.2.).  Hence,  the  fraction  of 
correct  classification  P,-(-  was  defined  as: 

n Number  of  Correctly  Classified  Targets 

l ('(’  ~~  ——  ‘ 

Number  of  Targets 

where  rank  order  permutations  of  targets  of  the  same  cluster 
are  insignificant  (i.e.,  they  are  correctly  classified  by  the 
metric),  whereas  rank  order  permutations  of  elements  of 
different  clusters  arc  significant  (the  targets  are  then 
incorrectly  classified). 

The  highest  value  of  the  fraction  of  correctly  classified  targets 
{Pcc= 0.8)  is  obtained  by  the  VP r measure  at 
T=(ThT2,T3,T4,T5).  Hence,  the  best  definition  of  integral 


Figure  9:  Target  and  empty  images.  Thresholding  of  the 
visual  patterns  produced  by  the  target  detector. 
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DccUJcn  Rule 


Figure  10:  Schematic  overview  of  the  computational 
distinctness  measure 

feature  for  perceiving  target  distinctness  on  the  dataset  in  this 
experiment,  is  T = (Th  T2,  T3,  T4,  Tf). 

The  comparative  results  of  the  RMSE  metric  and  the  VPT 
measure  based  on  the  best  definition  of  integral  feature  for 
predicting  visual  target  distinctness  are  presented  in  Table  1. 
At  the  bottom  of  each  of  the  columns  is  shown  the  respective 
fraction  of  correct  classification.  The  reference  rank  order  is 
listed  in  column  2. 

The  target  distinctness  values  and  the  resulting  rank  order 
computed  by  the  root  mean  square  error  ( RMSE)  metric  are 
listed  in  column  3.  The  RMSE  performs  poorly,  which  is  to  be 
expected.  Significant  rank  order  permutations  are  displayed  in 
boxes.  The  RMSE  metric  produces  a rank  order  with  five 
significant  order  reversals:  targets  16,  26,  29,  3,  and  1 1,  are 
significantly  out  of  order  relative  to  the  reference  order 
induced  by  the  psychophysical  distinctness  measure  in  column 

2.  The  other  targets  have  been  attributed  rank  orders  which  do 
not  differ  significantly  from  the  reference  rank  order.  The 
RMSE  yields  a relatively  low  probability  (Pci— 0.5).  These 
results  show  that  the  RMSE  metric  appears  not  capable  to  rank 
order  targets  in  the  dataset  with  respect  to  their  visual 
distinctness. 

The  target  distinctness  values  and  the  resulting  rank  order 
computed  by  the  VPpj , r2.  r3,  r4,  measure  are  listed  in 
column  4.  As  noted  above,  this  measure  yields  the  highest 
probability  (Pa— 0.8).  This  measure  induces  a rank. order  with 
two  significant  order  reversals:  targets  29  and  11  are  ordered 
incorrectly.  The  other  targets  have  been  attributed  rank  orders 
which  do  not  differ  significantly  from  the  reference  rank  order 
based  on  the  psychophysical  measure. 

Summarizing,  for  the  dataset  in  this  experiment,  the  VPT 
measure  with  T=(T t,  T2,  T3,  T4,  Ts)  appears  to  compute  a 
visual  target  distinctness  rank  ordering  that  correlates  with 
human  observer  performance. 

7.  CONCLUSION 

Here  a filtering  technique  was  presented  for  the  automatically 
learned  partitioning  of  "visual  patterns"  in  a digital  image. 
Log-Gabor  functions  were  adopted  as  an  appropriate  method 
to  construct  filters  of  arbitrary  bandwidth.  The  novelty  of  our 
proposal  lies  in  the  definition  of  "visual  patterns"  as  features 


which  have  the  highest  degree  of  alignment  in  statistical 
structure  across  different  frequency  bands.  The  interesting 
point  is  what  kind  of  objects  when  imaged  by  cameras  give 
rise  to  the  visual  patterns  that  the  RGFF  model  segregates. 
They  will  be  objects  whose  statistical  structure  across  scales 
and  orientations  can  be  distinguished  fairly  well  from  the  rest 
in  a natural  way.  This  limitation  of  the  approach  comes  from 
the  following  assumption  made  in  the  clustering  scheme  of  the 
Learning  stage  (Section[5]j:  the  data  set  of  activated  filters  has 
several  separable  clusters  (e.g.,  elongated  and  non-piecewise 
linear  separable  groupings  of  arbitrary  shape,  dense  and  sparse 
natural  clusters)  and  the  membership  is  determined  fairly  well 
in  a natural  way  by  the  data.  The  clarity  of  separation  between 
clusters,  as  measured  by  a dissimilarity  function,  was  the 
criterion  by  which  they  were  derived.  This  assumption  was 
needed  to  deal  with  several  problems:  (a)  to  overcome  the 
lack  of  knowledge  about  the  number  and  size  of  the  clusters 
in  the  data,  (b)  to  avoid  the  dependence  of  clustering  on  the 
initial  cluster  distribution,  and  (c)  to  find  elongated  and  non- 
piecewise  linear  separable  clusters,  as  well  as  to  identify  dense 
and  sparse  ones.  In  any  case,  the  existence  of  natural  clusters 
in  the  data  is  a very  realistic  assumption  to  many  interesting 
applications.  For  example,  because  of  the  differences  between 
the  statistical  structure  across  scales  and  orientations  of 
targets  and  rural  background  in  the  application  described  in 
Section  [6],  the  visual  distinctness  of  a man-made  object  (a 
military  vehicle)  in  a rural  background  can  be  determined  in  a 
natural  way  by  the  data. 

Finally,  a computational  visual  distinctness  measure  was 
presented  that  is  computed  from  the  image  representational 
model  based  on  visual  patterns.lt  was  applied  to  quantity  the 
visual  distinctness  of  targets  in  complex  natural  scenes. 

This  measure  that  applies  a simple  decision  rule  to  the 
distances  between  segregated  visual  patterns,  was  shown 
to  correlate  strongly  with  visual  distinctness  of  targets  in  a 
dataset,  as  estimated  by  human  observers. 
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APPENDIX 

Let  dj"  be  the  second  derivative  of  dj  computed  as: 

d](i)  = d J(i)*^-rGs(i) 

di 

with 

C.,(,>A-e«p{-Tr} 

and  where  d<  is  convolved  with  the  second  derivative  of  the 

j2 

Gaussian  at  scale  s,  noted  as  — Ta,(i),  to  both  smooth  and 
differentiate  the  function.  ' ' 

The  zero  crossings  of  d " correspond  to  positions  at  which  the 
dissimilarity  dj  undergoes  a significant  increment  in  its  value. 
To  locate  the  zero  crossings  marking  a rise  in  dj  due  to 
inter-cluster  differences,  the  unwanted  detail  from  intra-cluster 
differences  must  be  removed  by  smoothing.  The  question  is: 
how  much  smoothing  should  be  performed?  The  derivative 


should  be  processed  at  the  scale  that  best  describes  the 
increments  in  d,  due  to  inter-cluster  differences,  while 
removing  spurious  increments  due  to  intra-cluster 
differences. Each  interesting  structure  in  d,  comes  from  a 
significant  rise  in  dj  due  to  inter-cluster  differences,  and  the 
best  scale  for  describing  the  structrurc  should  be  based  on  its 
intrinsic  redundancy  across  scales  as  follows  [17], 

Because  structures  of  interest  exist  as  significant  entities  over 
a certain  range  of  scales  [18,19],  one  expects  to  find  some 
redundancy  across  the  different  scales  if  there  exist  significant 
structures  in  dj.  That  is.  a significant  structure  should  have  a 
greater  similarity  represented  at  its  natural  scales  (the  levels  of 
resolution  at  which  the  structure  can  be  perceived  in  dj). 

Two  smoothed  versions  of  dt  at  successive  scales  will  be 
correlated  to  the  extent  their  structures  arc  similar  at  the 
respective  scales.  And  we  can  determine  the  degree  of 
similarity  by  correlating  the  smoothed  versions  of  dj  at 
successive  scales. 

Let  d’0'  with  /=  1 L be  the  dissimilarity  dj  smoothed  by 

Gaussian  kernels  at  several  levels  of  smoothing .?(/)  ranging  in 
value  from  1 to  s(I.)  and  increasing  by  a constant  of  0.5  from 
one  level  to  the  next.  Then  the  normalized  redundancy 
measure,  denoted  as  A(s(lj).  between  two  smoothed  versions  df) 
and  d’0* 0 . at  successive  scales  .?(/)  and  ,v(/+ 1 ) can  be 
computed  by  cross-correlating  ds{,)  and  dfhl)  as  follows: 


A(s(D) 


where  < •,  •>  denotes  the  inner  product  in  the  Hilbert  space  of 
measurable:  squarc-integrablc  one-dimensional  functions,  and 
the  norm  (energy)  of  d’(/t  is  given  by  ||flp,,>  ||2. 

This  function  A(s(l))  returns  a value  measuring  the  relative 
redundancy  between  the  respective  smoothed  versions  at  two 
consecutive  scales.  Given  two  smoothed  versions,  and 
at  successive  degrees  of  smoothing .?(/),  s(h- 1)  of  a 
signal  d,.  the  value  of  the  normalized  function  A(s(lj)  at  .?(/)  is 
fairly  small  if  any  essential  structure  in  r/('(,)has  been  removed 
from  6f  ,,4h  . Hence,  each  location  .?(/)  of  local  minima  in 
A(s(l))  determines  a significant  scale  for  representing  a 
structure  of  interest  in  dt  (i.e..  a significant  rise  in  dj  due  to 
inter-cluster  differences). 

Consecucntlv,  in  order  to  locate  the  zero  crossings  of  dj" 
marking  a significant  rise  in  dh  the  second  derivative  of  dj  is 
then  computed  at  the  smallest  scale  from  the  set  of  locations 
s{l)  of  local  minima  in  A(s(/)).  The  derivative  processed  at 
the  smallest  significant  scale,  still  describes  the  increments  in 
d,  due  to  inter-cluster  differences,  while  removing  spurious 
increments  due  to  intra-cluster  differences. 


